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Abstract 

Question Answering (QA) is a task of answering natural language questions with adequate sentences. This paper 
proposes two methods to improve the performance of the QA system using a Q&A site corpus. The first method is for 
the relevant document retrieval module. We proposed modification of measure of mutual information for the query 
expansion; we calculate it between two words in each question and a word in its answer in the Q&A site corpus not to 
choose the words that are not suitable. 

The second method is for the candidate answer evaluation module. We proposed to evaluate candidate answers 
using the two measures together, i.e., the Web relevance score and the translation probability. The experiments were 
carried out using a Japanese Q&A site corpus. They revealed that the first proposed method was significantly better 
than the original method when their accuracies and MRR (Mean Reciprocal Rank) were compared and the second 
method was significantly better than the original methods when their MRR were compared. 



Introduction 

Question Answering (QA) is a task of answering ques- 
tions written in natural language with adequate sentences, 
which consists of the following four modules (Soricut and 
Brill 2006). 

(1) Question analysis 

(2) Relevant document retrieval 

(3) Candidate answer extraction 

(4) Candidate answer evaluation 

When a question written in natural language is input 
into the system, the system carries out keyword extrac- 
tion in the question analysis module. Then the system 
retrieves relevant documents using the keywords that 
were obtained in the last module in the relevant document 
retrieval module. After that, the system extracts candi- 
date answers in the candidate answer extraction module. 
The size of the candidate answers varies according to their 
question types, e.g., a phrase or a sentence. A sentence or 
a paragraph will be the candidate answer when the QA 

'Correspondence: kkomiya@cc.tuat.ac.jp 

^ Institute of Engineering, Tokyo University of Agriculture and Technology, 

2-24-16 Nakacho-Kotanei, Tokyo, 1 84-8588, Japan 

Full list of author information is available at the end of the article 



is non-factoid. Finally, the system estimates the qualities 
of the candidate answers that were obtained in the can- 
didate answer extraction module in the candidate answer 
evaluation module. 

This paper proposes two methods to improve the per- 
formance of the QA system using a Q&A site corpus. 
The first method is for the relevant document retrieval 
module. We proposed modification of measure of mutual 
information for the query expansion. The query expan- 
sion is an approach to extend query words by adding new 
words that are not included in each question to improve 
the qualities of the relevant documents to be retrieved. In 
previous work, words to be added are chosen based on 
mutual information between a word of each question and 
a word of its answer in the Q&A site corpus (Berger et al. 
2000). We calculated it between two words in each ques- 
tion and a word in its answer in the Q&A site corpus not 
to choose the words that are not suitable for the query 
expansion. 

The second method is for the candidate answer eval- 
uation module. The QA system estimates the qualities 
of candidate answers that were obtained by the docu- 
ment retrieval in this module. This module is important 
because it directly affects system's outputs. There are 
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two cues to estimate candidate answers, i.e., 1) the topic 
relevance, which evaluates association between each can- 
didate answer and its question in terms of its content, 
and 2) the writing style, which evaluates how the writing 
style of each candidate answer corresponds to its ques- 
tion type. In this paper, we propose to evaluate candidate 
answers using the Web relevance score (Ishioroshi et al. 
2009) and the translation probability (Soricut and Brill 
2006) together. 

We will show that our proposed methods improved each 
module by the experiments using a Japanese Q&A site 
corpus. 

This paper is organized as follows. Section 'Related 
work' reviews related work on QA. Sections 'Query 
expansion using mutual information' and 'Query expan- 
sion using two words in a question' explain how words 
for query expansion were determined in the relevant doc- 
ument retrieval module in the previous work (Berger 
et al. 2000) and the first proposed method for the mod- 
ule, respectively. Sections 'Candidate answer evaluation' 
and 'Candidate answer evaluation with web relevance 
score and translation probability' describe how candidate 
answers were evaluated in the candidate answer module 
in the previous work (Ishioroshi et al. 2009 and Soricut 
and Brill 2006) and the second proposed method for the 
module. Section 'Experiments' explains the experimental 
settings. We present the results in Section 'Results' and 
discuss them in Section 'Discussion'. Finally, we conclude 
the paper in Section 'Conclusion'. 

Related work 

Question Answering (QA), which involves answering 
questions written in natural language with adequate sen- 
tences, has been studied intensively in recent years within 
or outside the area of natural language processing. The 
QA systems within the area are sometimes called as open 
domain question answering systems because they are not 
domain specific (Ishioroshi et al. 2009). 

Types of questions that are treated by the QA systems 
can be categorized into two kinds, i.e., factoid and non- 
factoid. Questions of the former type ask the names of 
people or places, or the amounts of stuffs, e.g., "How tall 
is Mt. Fuji?". On the other hand, questions of the latter 
type ask definitions, reasons, or methods, e.g., "What are 
ES cells?". Our system treats the both types of questions 
in this paper. 

We proposed two methods to improve the performance 
of the QA system; the first method is for the query expan- 
sion of the relevant document retrieval module and the 
second method is for the candidate answer evaluation 
module. For the query expansion, Saggion and Gaizauskas 
(2004) proposed to obtain words for the query expansion 
using relevance feedback from the Web. They regarded 
words that appeared frequently in documents retrieved 



for each question query as the new words for the query 
expansion. Mori et al. (2007) and Derczynski et al. (2008) 
used tf-idf and Lin et al. (2010) used Okapi-BM25 for 
the criteria instead of the term frequency of Saggion and 
Gaizauskas (2004). Lao et al. (2008) proposed to obtain 
the synonyms of words in each question using boot- 
strap method and to use them for the query expansion. 
Saggion and Gaizauskas (2004) also used synonyms but 
obtained them from a dictionary. Liu et al. (2010) obtained 
them from Wikipedia. Finally, Berger et al. (2000) pro- 
posed to learn what kind of words tend to appear in 
answers when some words appeared in questions using 
a Q&A site corpus and to use words that frequently 
appear for the query expansion. We improved one of 
the approaches suggested by Berger et al. (2000) in this 
paper. 

For the query expansion, some researchers such as 
Higashinaka and Isozaki (2007,2008) and Isozaki and 
Higashinaka (2008) reported that the performance of the 
system improved when the question types were classified 
into classes such as "how-questions" and "why-questions" 
in advance. However, Ishioroshi et al. (2009) and Soricut 
and Brill (2006) developed a QA system without clas- 
sification of the question types. Ishioroshi et al. (2009) 
estimated the topic relevance by relevance feedback from 
the Web. 

Soricut and Brill (2006) and Berger et al. (2000) treated 
QA task as translation and succeeded in evaluating the 
topic relevance and the writing style simultaneously. We 
also improved them by combining their methods together 
without classification of the question types. 

Query expansion using mutual information 

Berger et al. (2000) proposed to learn what kind of words 
tend to appear in answers when some words appeared in 
questions using a Q&A site corpus and to use words that 
frequently appear for the query expansion. In their work, 
mutual information was used to measure the degree of rel- 
evance between a word in each question and a word in its 
answer. The formula of mutual information is as follows: 



KW q ,W a ) = P{w q ,W a )l0g 



P(w q , w a ) 



+ P(w q , w a )log 
+ P(w q , w a )log 
+ P(w q , w a )log 



P(w q )P(w a ) 

P(W q , W a ) 



P(w q )P(w a ) 

P(W q , Wq) 

P(w q )P(w a ) 

P(Wq, Wq) 
P(w q )P{Wa)' 



(1) 



where W q and W a represent binary random variables that 
show whether a word w q appear in each question and 
whether a word w a appear in its answer, respectively. 
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Wa 



W a = 



w q (w q appears in a question) 

w q {w q dose not appear in a question) 

w a (w a appears in an answer) 

w a (Wa dose not appear in an answer) 



(2) 



(3) 



The more w q and w a co-occur in a corpus, the grater 
their mutual information becomes. 

Berger et al. (2000) chose a word from its answer for 
every words in each question. It was the word that maxi- 
mized mutual information between the question word and 
the answer word itself. After this, {a word in a question — > 
a word in an answer} denotes the query expansion using 
this method. 

This method works effectively when the training and 
test corpora are domain specific*. However, it sometimes 
causes semantic drift when corpora are large and not 
domain specific. For example, when the question was 
"What are the connections between Softbank and yahoo?" 
it gave us the following results: {softbank — > hawks} b and 
{yahoo mail}. Hawks and mail are relevant with soft- 
bank and yahoo, respectively, but they should not be used 
for the query expansion because they are no relevance 
with the original question. 

Query expansion using two words in a question 

In order to alleviate the semantic drift, we propose to use 
mutual information based on two words in each question 
and a word in its answer. The new equation of mutual 
information is as follows: 

P(Wql,Wq2,W a ) 



KWql, W q2 , Wa) = P( Wql , W q2 , W a )hg 



+ P(Wql,Wq2,W a )l0g 
+ P(Wq 1 ,Wq 2 ,W a )l0g 
+ P(Wql,Wq2,W a )l0g 



P(w ql , w q2 )P(w a ) 

P(Wql,W q2 ,W a ) 



P(w q i, w q2 )P(w a ) 

P(W~ q l,W~ q2 ,W a ) 
P(Wql, W~ q2 )P(w a ) 

P(W~ ql ,W~ q2 ,Wq) 

P(Wq\, W~ q2 )P(w a ) 

(4) 

It represents the degree of co-occurrence between two 
words in a question and a word in its answer. The more 
w q \, w q2 , and w a co-occur in a corpus, the grater their 
mutual information becomes like equ. (1). For exam- 
ple, when the question was "What are the connections 
between softbank and yahoo?" , it gave us the following 
results: {softbank and yahoo -»• subsidiary}. 

Candidate answer evaluation 

The second proposed method is for the candidate answer 
evaluation. As mentioned above, the topic relevance and 
the writing style are used to estimate candidate answers. 
We introduced two existing methods for the module. First 
method is the work proposed by Ishioroshi et al. (2009), 



which estimated the topic relevance by relevance feed- 
back from the Web. They regarded words that frequently 
appeared in documents retrieved for each question query 
as relevant words. Therefore, candidate answers that con- 
tain many relevant words were regarded better in terms of 
the topic relevance. 
The relevance words are obtained as follows: 

(1) Make a keyword class K that contains content words 
(i.e., nouns, verbs, and adverbs) in each question 

(2) Choose three words from K in all combinations and 
search the Web by them 

(3) Obtain at most 100 Web snippets, i.e., summaries of 
the Web documents that were obtained by a Web 
search engine, for each query 

Each content word Wj in these snippets is treated as a 
relevant word for the question. The relevance degree of 
the relevant word, i.e., T(wj) is defined by the following 
equation: 

freq{Wj, i) 



T{w0 = max ■ 



(5) 



where i represents a index of a query (i.e., triple of con- 
tent words), n denotes the number of snippets obtained 
from i t h query, freq(Wj, i) denotes the number of snippets 
that contain word Wj that were obtained from z' ( /, query. 
Candidate answer evaluation score in terms of the topic 
relevance, i.e., Web_relevance(Q, A) is defined as the sum 
of the relevant degrees of the relevant words contained in 
each candidate answer as follows: 



Web_relevance(Q,A) = ^ T{wj), 



(6) 



/'=1 



where Q represents a question, A represents its candidate 
answer, / denotes the number of words in the candidate 
answer, and Wj denotes each word in the candidate answer. 

Finally, Ishioroshi et al. (2009) evaluated candidate 
answers using the following score that took into con- 
sideration the topic relevance and the writing style as 
well: 

1 

Score(Si) =- 



ln(l 



length{Si)) 

Y 

/•=i 



(7) 



where / denotes the number of types of word Wy in a 
sentence Si, m represents the number of types of writing 
feature b^ in Si, length(Si) means the number of the char- 
acters of Si, x 2 denotes the score of each writing style, 
and y represents the weighting parameter. As for x 2 , a 
chi-square value is calculated between the answers that 
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include the writing feature b^ and the top N answers 
retrieved for the question query. 

The second method is the work by Soricut and Brill 
(2006), which treated QA task as translation. They suc- 
ceeded in evaluating the topic relevance and the writing 
style simultaneously. In their method, each question and 
its answer are regarded as the source and target sentences, 
respectively. For translation, word-by-word translation 
probabilities are learned using a Q&A site corpus. When 
a question is input into the system, this system calculates 
the translation probabilities from the question into their 
candidate answers. Then the candidate answers are evalu- 
ated using their probabilities. They used the IBM-Modell 
(Brown et al. 1993) as a translation model, which is simple 
but showed efficacy in many tasks. The answer evaluation 
is formulated as follows using IBM-Modell (as in Berger 
et al. (2000)) : 

A* = argmaxP(A|Q) = argmaxP(Q|>l)P04) (8) 

A A 



m , I 




+ j^jp(q;\NULL)), (9) 

where A* represents the most adequate candidate answer, 
Q(= <7i> <72» • • • . 1m) and A(= ay, a 2 ,..., «/) each repre- 
sents a question and its candidate answer, m and / each 
denotes the number of words in the question and its can- 
didate answer 0 , P(q\a) represents the translation probabil- 
ity from a word a in an answer to a word q in a question, 
c(ai\a) are the relative counts of the answer words, P{A) 
denotes generation probability of the candidate answer 
A, and e is a probability of generating a question whose 
length is m from the candidate answer. 

We can have the equation (10) by assuming that c{ai\a) 
is 1/7 like Brown et al. (1993). 

m l+l 

In equation (10), there is a problem where the less 
the number of words in a candidate answer becomes, 
the more its translation probability increases because the 
value of the coefficient increases as / decreases. Therefore, 
we neglected the coefficient and got equation (11) instead 
of equation (10). 

m l+l 

P(Q\A) wfj^^W (11) 
j=i (=1 



Candidate answer evaluation with web relevance 
score and translation probability 

When evaluating the topic relevance, the method using 
the translation probability proposed by Soricut and Brill 
(2006) can flexibly capture synonyms. This is because 
the translation probabilities are learned from the mas- 
sive examples of a Q&A site corpus beforehand. However, 
it is unable to capture the co-occurrence information of 
several words in a question because it only utilizes word- 
to-word translation probabilities. By contrast, the Web 
relevance score proposed by Ishioroshi et al. (2009) can 
capture the co-occurrence information but cannot cap- 
ture the synonyms because the Web documents dynam- 
ically obtained are small. Thus, it seems that the answer 
evaluation method using these methods simultaneously 
would be able to achieve the greater performance. 

New answer evaluation formula 

Equation (12) is the new formula of the answer evalua- 
tion that uses the Web relevance score and the translation 
probability: 

EvalScore{Q,A) = 0>{Q,A) l - Y ■ Web_relevance{Q,A) y 

(12) 

&(Q,A) = P(Q\A)P(A), (13) 

where 3^(Q,A) represents the probability that should be 
maximized in the equation (8) (the score by Soricut and 
Brill (2006)), Web_relevance(Q,A) denotes the score using 
Web relevance score (the score by Ishioroshi et al. (2009)), 
and y represents the weighting parameter. The equation 
(12) is equivalent to the translation probability when 
y = 0 whereas it is the same as the Web relevance score 
when y = 1. 

Experiments 

Two kinds of the experiment were carried out using 
a Japanese Q&A site corpus, i.e., the 100 questions of 
"NTCIR-ACLIA2" (Mitamura et al. 2010), as the test 
questions. The "Yahoo! Chiebukuro" data were used as 
examples of a Q&A site corpus for calculation of mutual 
information and for training of the translation proba- 
bility. The "Yahoo! Chiebukuro" data is distributed to 
researchers from the National Institute of Informatics 
based on a contract with the Yahoo Japan Corpora- 
tion (National Institute of Informatics 2009) . "Yahoo! 
Chiebukuro" is the largest knowledge retrieval service in 
Japan, and the Yahoo Japan Corporation has been provid- 
ing this service since April 2004. Their aim is to connect 
people who want to question and those who want to 
answer, and the sharing of wisdom and knowledge among 
the participants. The National Institute of Informatics 
provides data consisting of 3.11 million questions and 
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13.47 million answers (total text size of 1.6 billion char- 
acters) submitted between April 2004 and October 2005 
out of about 10 million questions and about 35 million 
answers currently stored. The 100 questions of "NTCIR- 
ACLIA2" is included in NTCIR-8 ACLIA test collection 
(National Institute of Informatics 2012). This test collec- 
tion includes 100 Japanese topics of Mainichi News Paper, 
which consists of 377,941 documents between years 2004 
and 2005. It can be used for experiments of Complex 
Question Answering. 

Morphological analysis was only carried out in the 
question analysis module although some works such 
as (Oda and Akiba 2009) and (Mizuno et al. 2007) 
classified question types there. ChaSen (Kyoto Uni- 
versity and NTT 2013) was used as a morphological 
analyzer and the Yahoo! API (Yahoo Japan Corpora- 
tion 2013) was used as a search engine. A candidate 
answer is always a sentence since we did not classify 



the question type. Web documents were retrieved for a 
query of all the question's content words with or with- 
out query expansion and were used as the source of 
the candidate answers. Thus, we did not have tagged 
answers. 

Experiments of query expansion 

The experiments were carried out as follows for each 
question. First, words were chosen from answers of a Q&A 
corpus as candidates for the query expansion. Here, each 
single word was chosen for every combination of two 
words in question for the system of the proposed method. 
By contrast, each word was chosen for every word in ques- 
tion for the system of the original method. Next, the top 
three words at most are chosen in the order of mutual 
information as the words to be added for the query expan- 
sion. Finally, the candidate answers were retrieved and 
evaluated. 



Content words of a question 
New words by query expansion 
£ 



Search Engine 



Content words of a question 



Search Engine 




| Dl U Q 2 



Candidate Answer Evaluation 



; Top-5 answers 



Candidate Answer Evaluation 



Top-5 answers 



query expansion 

Figure 1 Outlines of Web document retrieval and candidate answer evaluation with and without query expansion. This figure shows the 
outlines of the Web document retrieval and the candidate answer evaluation with and without query expansion. D1 UD2 and D3 each represents the 
candidate answers with and without query expansion. The number of the documents of Dl U D2 was equalized to that of D3 for the fair comparison. 
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Figure 1 shows the outlines of the Web document 
retrieval and the candidate answer evaluation of the sys- 
tems with and without query expansion. The documents 
were retrieved from the Web two times for the system 
with the query expansion: using all content words and 
using all content words and all new words for query 
expansion. The candidate answers were collected from 
the two sets of document retrieved by the system. On 
the other hand, the documents were retrieved from the 
Web only once for the system without query expansion: 
using all content words. The candidate answers were col- 
lected from them. Dl is a subset of documents retrieved 
by a query with the query expansion and D2 and D3 are 
subsets of documents retrieved by a query without query 
expansion in Figure 1. Dl U Dl and D3 each represents 
the candidate answers with and without query expansion. 
The number of the documents of Dl U D2 was equal- 
ized to that of D3 for the fair comparison; we set it to 
80 documents. The score proposed by Ishioroshi et al. 
(2009) (the score of equ. (7)) was used for the candi- 
date answer evaluation. The weighting parameter y is set 
to 0.5. Unigrams were used as the feature of the writing 
style. 

Experiments of candidate answer evaluation 

GIZA++ (Casacuberta and Vidal 2007), which is the 
implementation of IBM-Modell, was used as a learning 
tool for the translation probability. The number of itera- 
tions of EM-algorithm was set to five times. The examples 
of a Q&A site corpus whose question or answer contains 
more than 60 words were preliminarily cut off because 
they negatively affected the learning of word alignment; 
they contained too many words. Moreover, the examples 
of a Q&A site corpus whose number of the words in the 
question is more than five times as many as that in the 
answer were cut off and vice versa for the same reason. As 
a result, 1,092,144 examples in the "Yahoo! Chiebukuro" 
data were used as the training data of GIZA++. 

Fifty Web documents retrieved for a query without 
query expansion were chosen as the candidate answers 
and were evaluated by the proposed or original formula of 
the answer evaluation. 

The bigrams normalized by the number of words were 
used for P{A). 

Results 

Each candidate answer retrieved from Web documents 
was evaluated in the answer evaluation module and the 
QA system output the top-5 answers. The outputs of 
the system were checked manually. The top-5 accuracies 
and the MRR (Mean Reciprocal Rank) of the QA system 
were evaluated. The answer the system output is correct 
if it is in the top-5 answers when the top-5 accuracy is 
calculated. The top-5 accuracy is formulated as follows: 



answered _question 
top — 5_accuracy = — , (14) 

where answered_question is the number of the question 
where the system output the correct answer in the top-5 
answers. MRR is formulated as follows: 

1 N 1 

MRR = Tt E — E7V ( 15 > 
N r-j' mnk(i) 

where rank(i) represents the best rank of the correct 
answer of the z'th question. MRR takes into consideration 
the rank of the output whereas the top-5 accuracy does 
not. 

Results of query expansion 

Table 1 shows the top-5 accuracies and MRR of the exper- 
iments of the query expansion. The original method in the 
table represents the method of Berger et al. (2000), where 
words to be added are chosen based on mutual informa- 
tion between a word from each question and another word 
from its answer. This table shows the system with the pro- 
posed method outperformed the system without query 
expansion and the system with the method of Berger et al. 
(2000). It also showed that the system with the method of 
Berger et al. (2000) is inferior to the system without query 
expansion. We think this is because the large corpus we 
used caused the semantic drift. Thus, we think the method 
of Berger et al. (2000) is unsuitable for the open-domain 
QA. 

On the other hand, the proposed method can choose 
words to be added for the query expansion without the 
semantic drift, because it considers the co-occurrence of 
not only one word but also two words from each ques- 
tion and another word from an answer. The difference 
between the original method and the proposed method 
was significant though the difference between the sys- 
tem without query expansion and the proposed method 
was not, according to a Wilcoxon signed-rank test. The 
significance level was 0.05. 



Table 1 Results of experiments of query expansion 





Without query 


Original method 


Proposed method 




expansion 






Accuracy 


0.420 


0.400 


0.450 


MRR 


0.262 


0.233 


0.273 



This table summarizes the top-5 accuracies and MRR of the systems for the 
experiments of the query expansion. Original method in the table represents the 
method proposed by Berger et al. (2000), where the words to be added are 
chosen based on mutual information between a word from a question and 
another word in its answer. This table indicates that the system with the 
proposed method outperformed the two systems: the system without query 
expansion and the system with the method proposed by Berger et al. (2000). 
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Figure 2 Performance of proposed method. This figure shows the top-5 accuracies and MRR of the experiments of the candidate answer 
evaluation when the value of y changed from 0 to 1 . The top-5 accuracy was maximized to 0.59 when y = 0.93 and the MRR was maximized to 
0.461 when y =0.98. 



Results of candidate answer evaluation 

Figure 2 shows the top-5 accuracies and MRR of the 
experiments of the candidate answer evaluation when 
the value of y changed from 0 to 1. Table 2 lists the 
performances of the original methods and the proposed 



Table 2 Results of experiments of answer candidate 
evaluation 





Top-5 accuracy 


MRR 


Only Web relevanve (y = 1) 


0.55 


0.423 


Only translation probability (y = 0) 


0.49 


0.318 


Proposed method (y = 0.93) 


0.59 


0.395 


Proposed method (y = 0.98) 


0.57 


0.461 



This table summarizes the top-5 accuracies and MRR of the systems for the 
experiments of the candidate answer evaluation. As for MRR, the proposed 
method was significantly better than the original methods according to a 
Wilcoxon signed-rank test. 



method. Table 2 shows that the top-5 accuracy was max- 
imized to 0.59 when y = 0.93. In addition, the MRR was 
maximized to 0.461 when y =0.98. As for the MRR, the 
proposed method was significantly better than the orig- 
inal methods according to a Wilcoxon signed-rank test. 
The significance level was 0.05. 

Discussion 

We will show examples of the results and discuss them in 
this section. 

Query expansion 

The examples A, B, and C show the examples from the 
experimental results. 

Example A Some examples where the new found 
word was the answer to the question could be found. 
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Question 

"(What are the connections between Softbank 
and yahoo?)" 

Query expansion via the original method 

{V7 h/<>? (softbank) -> * — ? % 
(hawks)} and {^7— (yahoo) / —As 
(mail)} 

Query expansion via the proposed method 

{V7f^y? (softbank) and "V 7 — (yahoo) 
— >- i^^-fi (subsidiary)} 

According to Example A, we can see that the direct 
answer to the question was selected as a word for the 
query expansion via the proposed method. Even if the 
word subsidiary is not the direct answer, it is suitable for 
the query expansion because it has close connections with 
softbank and hawk. 

Example B Some examples where the new found 
word was a clue to the question were also found. 

Question 

"(Let me know events related to the peace 

plan of India and Pakistan?)" 

Query expansion via the original method 

■i ~y V (India) ->■ (curry)} and 

{/N'dpX^ 1/ (Pakistan) -* 4 X y A (Islamic)} 
Query expansion via the proposed method 

H V K (India) and 9 1/ (Pakistan) 

# is 5 — )V (Kashmir)} 

Example B is a QA where the system cannot answer 
in a word; it is a non-factoid question. Kashmir is an 
important word because it is area that is close to India 
and Pakistan. On the other hand, curry, the word that 
is irrelevant to the question, was chosen via the origi- 
nal method. These words would cause the semantic drift, 
which sometimes makes it difficult to find documents 
that are relevant to the question. These words were fre- 
quently chosen via the original method, which decreased 
the performance of the system. We think that these cases 
did not happen in the experiments by Berger et al. (2000) 
because they used relatively small and domain specific 
corpora, 

On the contrary, the proposed method where the sys- 
tem chooses the words that maximize mutual infor- 
mation between two words from a question and one 
word from its answer chose these words less fre- 
quently than the original method. It enabled better 
document retrieval for the QA that is not domain 
specific. 



Example C These were some examples where the 
new found words were irrelevant to the question. 

Question 

"2004 ¥l-Hiftffi*&^ift» LfcroSift-tf-C-f^ ? " 
"(Why did the price of crude increase in year 
2004?)" 

Query expansion via the original method 

{rSifit (increase) — > -fTiffi (oil)} and (year, 

age) —> !bJs§ (marriage)} 

Query expansion via the proposed method 

{ihjSS (increase) and^ (year, age) — ► iS# 
(player or athlete)} 

ft" 5 (does) and ¥ (year, age) -> Wffl 
(marriage)} 

Words that are irrelevant to the question were chosen 
for the query expansion even via the proposed method 
in Example C. There were many cases like them when 
general words were used for the calculation of mutual 
information. Therefore, we think that the words to calcu- 
late mutual information should be carefully selected in the 
future. 

Candidate answer evaluation 
Web relevance score 

We will discuss about how scores of the topic relevance 
from the Web contributed the results. Examples D and E 
have examples of the Web relevance score for factoid and 
non-factoid questions, respectively. Web relevance scores 
of the words in answers are shown in brackets. Those of 
the words in questions were omitted. 

Example D The relevant words could be obtained via 
the Web relevance score for some factoid questions. 

Question 

"2008 ¥»*y y^y^}-±t*^x'mm^tfi-f-i\" 

"(Where does the Olympic hold in 2008?)" 
Example of answer 

"At'M" "(Beijing)" 
The relevant words 

OCR., Beijing (0.67), ift, place (0.29), X^, 
convention (0.29), 11 (0.24), 2011 (0.24), 12 
(0.23), HI, times (0.23), 11 Japan (0.23), JUm, 
Tokyo (0.22), ftS, represent (0.21), »tfi, 
game (0.2), 20 (0.2), M% summer (0.19), 
winter (0.19), ^B, China (0.19)- • • 

Direct answers could be obtained when the question 
was factoid as shown in Example D. We could particu- 
larly obtain Beijing, which was related to both 2008 and 
Olympic, although we could hardly obtain these words 
via the method using only translation probability that can 
only take into consideration one word at a time. 
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Example E The relevant words could be obtained via 
the Web relevance score for some factoid questions. 

Question 

"(What are ES cells?)" 
Example of answer 

(#FrfJ) 

7*7* --f 1^ ^ ^ — h CO 

"(The research team of Center of Developmental 
Biology, Institute of Physical and Chemical 
Research in Kobe, whose leader is 
Yoshiki Sasai, succeeded in largely culturing 
embryo-stem cells (ES cells), which have 
ability to be any types of cells in various 
tissues in body, and effectively changing them 
into cerebral cells.)" 

The relevant words 

ffi, sex (0.29), 9% research (0.26), stem 
(0.24), t h human (0.23), E, embryo (0.2), 
frit, differentiation (0.18), ^ V X, mouse, 
mice (0.18), W^, remodeling, regeneration 
(0.16), Hit, tissue (0.15), UW., medical (0.13), 
fft, like (0.13), #s body (0.13), Prf-, science 
(0.12), ±p*, culture (0.11), iPS (0.11),- • • 

Direct answers to the question could not be obtained 
when the question was non-factoid. However, the words 
that are related to the question could be obtained. The 
suitable answers that include the relevant words could be 
also obtained as shown in Example E. However, words that 
frequently appear in many documents could not be dis- 
tinguished from those that co-occur with content words 
in the question using mutual information. Thus, we think 
that the selection of these words using IDF will be able to 
be tried in the future. 

Translation probability 

We will discuss about how the translation probability 
contributed the results. 

Table 3 has examples of the top-5 words that maximize 
P(q\a), which is the translation probability from a word a 
in an answer to a word q in a question when a is given. The 
English words and the numbers in brackets are the English 
translations and the translation probabilities, respectively. 



For example, when "\Si'M" (medical care) was given as a 
word in an answer, it tended to be translated into "Eli?" 
(medical care), "SI?E" (hospital), "it" (fare), "AISi:" (medi- 
cal admission), and "^w" (operation) in its question. This 
indicates that "Stm" (medical care) tends to appear in the 
answer when these words appear in its question. The func- 
tions of Japanese words are shown when the English words 
are written in upper case. 

Table 3 firstly shows words in answers are likely to be 
translated into themselves in their questions. This indi- 
cates that words in questions tend to appear in their 
answers. Next, the table shows words in answers are likely 
to be translated into their relevant words and synonyms 
as shown in the case where (1) "APJt;" (medical admis- 
sion) and "##f " (operation) for "ESS" (medical care), and 
(11) "tt'tli" (prime minister) for (prime minister) 

are listed in the table. This indicates that relevant words 
and synonyms of words in question tend to appear in their 
answers. 

The properties of the relevant words and the synonyms 
that were obtained using the translation probability are 
different from those obtained from 100 Web documents 
because they were from approximately one million exam- 
ples of a Q&A site corpus. Therefore, we think that the 
performance of the QA system improved because the Web 
relevance score and the translation probability comple- 
mented one another. 

We expected that (13) "tzfrh" (because), (14) "frh" 
(because, from), and (15) "tz.W (because, for) were likely 
to co-occur with "^"ff" (why) or "c 5 LT" (why), which 
often appeared in questions, because they often appeared 
in answers of QA, but they did not. We think that this 
is because the particles like *fy* h" (because, from) and 
"fzlsb" (because, for) are ambiguous. Soricut and Brill 
(2006), who used an English Q&A corpus for learn- 
ing, reported that "because" tended to be translated into 
"why" We think that the method worked well because the 
English word "because" was less ambiguous than Japanese 
words like "t^h" (because, from) and "Tcfc" (because, 
for). 

However, (16) (reason) , which is also likely to 

appear in answers to why-type questions, could be leaned 
as the word that tended to be translated into (why). 
This indicates that learning with the translation probabil- 
ity could be able to partially evaluate the writing style. 

In addition, the words that appeared few times tended 
to be learned not correctly. For example, (12) "^X - ^" 
(shepherd's-purse) were hardly translated into relevant 
words because it appeared only twice in the Q&A site 
corpus. Moreover, some unsuitable words were chosen 
because the translation probabilities only depended on 
the Q&A site corpus. The "Yahoo! Chiebukuro" data are 
examples of Q&A site submitted from April 1st 2004 to 
October 31th 2005. Therefore, (Koizumi), who was 
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Table 3 Examples of translation probability 



Index 


Given word 


1st 


2nd 


3rd 


4th 


5th 


(1) 




(.064) 


SIS; (.057) 


fi" (.037) 


A IS (.026) 


(.024) 




Medical care 


Medical care 


Hospital 


Fare 


Admission 


Operation 


(2) 




§?|S(.097) 


Scffl (.032) 


fix. 6 (.021) 


#ff ±(.016) 


IS (.016) 




Lawsuit 


Lawsuit 


Judgment 


Sue over 


Advocate 


Right 


(3) 


Jt7X 


ifok (.081) 


7JX (.034) 


(.025) 


(.023) 


(.022) 




Salt water 


Salt water 


Water 


Tastes salty 


Method 


Shell 


(4) 


mm 


MM (.033) 


(.020) 


7CT/C019) 


"nil, (.015) 


±M(.015) 




Landform 


Landform 


Yokohama 


Times 


Typhoon 


Geography 


(5) 


mm 


fSlK(.115) 


S; (.0373) 


titW- (.024) 


t-tft (.024) 


S^b(.021) 




Channel 


Channel 


Island 


World 


Takeshima 


Tohoku 


(6) 


mm 


Stiff (.249) 


ft 5 (.060) 


MS 5 (.031) 


i~-5 (.028) 


fc (.023) 




Arrestment 


Arrestment 


PASSIVE 


Get caught 


Do 


PAST 


(7) 


Km 


MS (.132) 


? (.031) 


"^"T (.030) 


& "if (.029) 


A Pel (.026) 




Cell 


Cell 


? 


PREDICATION 


Why 


Human 


(8) 


ffffg 


'Iff! (.146) 


<D (.060) 


"C"f~(.031) 


"t"5 (.031) 


/J 5 (.030) 




Information 


Information 


Of 


PREDICATION 


Do 


AGENT 


(9) 


mi 


K#f (.134) 


© (.065) 


f± (.040) 


"Ci" (.039) 


?(.038) 




Technology 


Technology 


Of 


TOPIC MARKER 


PREDICATION 


7 


(10) 




A^frljB (.298) 


T / V tl (.098) 


•7" S> 3. (.039) 


(7) (.025) 


?(.021) 




President 


President 


America 


Bush 


Of 


? 


(11) 


US 


ItS (.222) 


4^ (.138) 


AfS(.047) 


£ A/ (.040) 


(.030) 




Prime minister 


Prime minister 


Koizumi 


Minister 


Mr. 


Prime minister 


(12) 




(.191) 


'(.170) 


of (.143) 


CD (.079) 


7}^ (.045) 




Shepherd's-purse 


Seven herbs 


? 


As for 


Of 


QUESTION 


(13) 




?(.064) 


(7) (.063) 


"C"t" (.052) 


!1 (.049) 


/5 s (.046) 




Because 


? 


Of 


PREDICATION 


TOPIC MARKER 


QUESTION 


(14) 




00 (.073) 


"C"t" (.058) 


? (.056) 


7> (.050) 


7) s (.046) 




Because 


Of 


PREDICATION 


? 


QUESTION 


AGENT 


(15) 


1t$> 


(D (.088) 


"CM" (.061) 


7 (-055) 


(.052) 


7C (£> (.048) 




For 


Of 


PREDICATION 


7 


QUESTION 


For 


(16) 


as 


1ft (0.17) 


&"t? (0.04) 


/5 s (0.04) 


(D (0.04) 


^ (0.03) 




Reason 


Reason 


Why 


AGENT 


Of 


QUESTION 



This table has examples of the top-5 words that maximize P(q\a), which is the translation probability from a word a in an answer to a word q in a question when a is 
given. The English words and the numbers in brackets are the English translations and the translation probabilities, respectively. The functions of Japanese words are 
shown when the English words are written in upper case. 



the prime minister at that time, and "Bush" who was the 
president of USA at that time, were chosen as the words 
likely to be translated from "fUSI" (prime minister) and 
"ASfiTiH" (president), respectively. 

Conclusion 

Question Answering (QA) is a task of answering natural 
language questions with adequate sentences. It includes 
the relevant document retrieval and candidate answer 



evaluation modules. This paper proposed two methods to 
improve the performance of the QA system using a Q&A 
site corpus. The first method is for the query expansion 
in the relevant document retrieval module. We proposed 
modification of measure of mutual information for the 
query expansion; we calculate it between two words in 
each question and a word in its answer in the Q&A 
site corpus not to choose the words that are not suit- 
able. The second method is for the candidate answer 
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evaluation module. We proposed the method to evaluate 
candidate answers using existing two methods, i.e., the 
Web relevance score and the translation probability. We 
showed that the proposed method evaluated the candi- 
date answers more effectively than the original methods. 
The experiments were carried out using a Japanese Q&A 
site corpus. They revealed that the first method was signif- 
icantly better than the original method when the accura- 
cies and MRR were compared. They also showed that the 
second method was significantly better than the original 
methods when the MRR were compared. 

Endnotes 

a Berger et al. (2000) used Usenet FAQ documents and 
customer service call-center dialogues from a large retail 
company. 

b We got this word because we had a baseball team 
named Softbank hawks in Japan. 

c P(qj\di) was summed from 1 to / + 1 because each 
question word had exactly one connection to either a 
single answer word or empty. 
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